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ABSTRACT 


Autonomous agents deployed in the real world need to be robust against adver- 
sarial attacks on sensory inputs. Robustifying agent policies requires anticipating 
the strongest attacks possible. We demonstrate that existing observation-space at- 
tacks on reinforcement learning agents have a common weakness: while effective, 
their lack of information-theoretic detectability constraints makes them detectable 
using automated means or human inspection. Detectability is undesirable to ad- 
versaries as it may trigger security escalations. We introduce e-illusory attacks, 
a novel form of adversarial attack on sequential decision-makers that is both ef- 
fective and of e-bounded statistical detectability. We propose a novel dual ascent 
algorithm to learn such attacks end-to-end. Compared to existing attacks, we em- 
pirically find ¢-illusory attacks to be significantly harder to detect with automated 
methods, and a small study with human participantd!| suggests they are similarly 
harder to detect for humans. Our findings suggest the need for better anomaly 
detectors, as well as effective hardware- and system-level defenses. The project 


website can be found at https: //tinyurl.com/illusory-attacks 


1 INTRODUCTION 


The sophistication of attacks on cyber-physical systems is increasing, driven in no small part by 
the proliferation of increasingly powerful commercial cyber attack tools (NSCS}/2023). Al-driven 


technologies, such as virtual reality systems and large-language model assis- 
tants are opening up additional attack surfaces. Further examples are deep 
learning methods in autonomous driving tasks 
2022), deep reinforcement learning methods for robotics (Todorov et al. Andrychowicz et al. 
2020), and nuclear fusion (Degrave et al.|{2022). While AT can be used for cyber defense, the threat 
from automated AlI-driven cyber attacks is thought to be significant (Buchanan et al.||/2023) and the 
future balance between automated attacks and defenses hard to Se ETS 
Beyond its beneficial use, deep reinforcement learning has also been proposed as a method for 
learning flexible automated attacks on AlI-driven sequential decision makers 
common approach to countering adversarial attacks is to use policy robustification (Kumar et al. 
2021} (2021). This approach can be effective, as visualized by the red-circled budgets in 
Fig.|1| However, as we show in this work, for observation-space attacks with larger budgets (grey 
circles in Fig. m. robustification can be ineffective. The practical feasibility of large budget attacks 
has been highlighted in domains such as visual sensor attacks (Cao et al.| patch attacks), as 
well as botnet evasion attacks 2021). This highlights the 
importance of a two-step defense process in which the first step employs anomaly detection 


2023), followed by attack-mitigating security escalations. This coincides with common cy- 
bersecurity practice, where intrusion detection systems allow for the implementation of mitigating 


contingency actions as a defense strategy (Cazorla et al.||2018). Therefore, effective cyber attackers 
are known to prioritize detection avoidance (Langner}|2011| STUXNET 417 attack). 
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Figure 1: We see adversary performance (reduction in the victim’s reward) mapped against the KL 
divergence between the unattacked training and the attacked test distribution. Attacks with a small 
L2 attack budget (indicated by small circles) can be defended against using randomized smoothing, 
and attacks with a large KL divergence can be defended against by triggering contingency options 
upon detection of the attack (purple shaded area). Illusory attacks (blue) can achieve significantly 
higher performance than classic adversarial attacks (black), as they allow to limit the KL divergence 
and thereby avoid detection. 


In this paper, we study the information-theoretic limits of the detectability of automated attacks on 
cyber-physical systems. To this end, we introduce a novel observation-space illusory attack frame- 
work, which imposes a novel information-theoretic detectability constraint on adversarial attacks 
that is grounded in information-theoretic steganalysis (Cachin|{1998). Unlike existing frameworks, 
the illusory attack framework naturally allows attackers to exploit environment stochasticity in order 
to generate effective attacks that are hard (e-illusory), or even impossible (perfect illusory) to detect. 


We propose a theoretically-grounded dual ascent algorithm and scalable estimators for learning il- 
lusory attacks. On a variety of RL benchmark problems, we show that illusory attacks can exhibit 
much better performance against victim agents equipped with state-of-the-art detectors than conven- 
tional attacks. Lastly, in a controlled study with human participants, we demonstrate that illusory 
attacks can be significantly harder to detect visually than existing attacks, owing to their seeming 
preservation of physical dynamics. Our findings suggest that software-level defenses against au- 
tomated attacks alone might not be sufficiently effective, and that system-wide and hardware-level 
robustification may be required for adequate security protection (Wylde|/2021). We also suggest that 
better anomaly detectors for sequential-decision-making agents should be developed. 


Our work makes the following contributions: 


e We formalize the novel illusory attack framework with information-theoretically grounded 
attack detectability constraints. 


e We propose a dual ascent algorithm and scalable estimator to learn illusory attacks in high- 
dimensional control environments. 


e We show that illusory attacks can be effective against victims with state-of-the-art out-of- 
distribution detectors, whereas existing attacks can be detected and hence are ineffective. 


e We show that illusory attacks are significantly harder to detect by human visual inspection. 


2 RELATED WORK 


Please see Appendix [A. I] for additional related work. 


The adversarial attack literature originates in image classification (Szegedy et al.||2013), where 


attacks commonly need to be visually imperceptible. Visual imperceptibility is commonly proxied 


by simple pixel-space minimum-norm perturbation (MNP) constraints (Goodfellow et al.) |2014 
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Madry et al.||2023). Several defenses against MNP attacks have been proposed (Das et al.| et al. [2018] 
Xu et al. (2023). Various strands of research in aaa 
security concern adversarial patch (AP) attacks that do not require access to all the sensor pixels, 
and commonly assume that the attack target can be physically modified 
(2021). Illusory attacks differ from both MNP and AP attacks in that they are information- 
theoretically grounded and undetectable even for large budgets. 


MNP attacks have been extended to adversarial attacks on sequential decision-making 


agents (Chen et al.|/2019b}|Hahi et al.||2021 2021). In the sequential MNP frame- 


work, the adversary can modify the victim’s observations up to a step- or episode-wise perturbation 
budget, both in white-box, as well as in black-box settings. and|Sun et al.| (2021) 
use reinforcement learning to learn adversarial policies that require only black-box access to the 
victim policy. Work towards robust sequential-decision making uses techniques such as random- 


ized smoothing (Kumar et al.|/2021 2021), test-time hardening by computing confidence 
bounds (Everett et al. |2021), training with adversarial loss functions (Oikarinen et al.|/2021), and 
co-training with adversarial agents (Zhang et al.|/2021a 2020 2022). We 


compare against and build upon this work. 


Another body of work focuses on detection and detectability of learnt adversarial attacks on 


sequential decision makers. Perhaps most closely related to our work, (2022) 


study action-space attacks on low-dimensional stochastic control systems and consider information- 
theoretic detection 2014) based on stochastic 
equivalence between the resulting trajectories. We instead investigate high-dimensional observation- 
space attacks, and consider learned detectors, as well as humans. 


Al-driven attacks on humans and human-operated infrastructure, such as social networks, are an 
active area of research (Tsipras et al.||2018). (Ye & Lil |2020) consider data privacy and security 
issues in the age of personal human assistants, and investigate automated social 
engineering attacks on professional social networks using chatbots. Illusory attacks signify that such 
automated attacks may be learnt such as to be hard to detect, or indeed undetectable. 


Within information-theoretic hypothesis testing, Bayesian optimal experimental design 
1995) studies optimisation objectives that share similarities with the illusory attack 
objective. introduce several classes of fast EIG estimators by building on ideas 
from amortized variational inference. use deep reinforcement learning for 


sequential Bayesian experiment design. 


3 BACKGROUND AND NOTATION 


We denote a probability distribution over a set ¥ as P(X), and an unnamed probability distribution 
as P(-). The empty set is denoted by Ø, the indicator function by 1, and the Dirac delta function 
by ô(-). Kleene closures are denoted by (-)*. For ease of exposition, we restrict our theoretical 
treatment to probability distributions of finite support where not otherwise indicated. 


3.1 MDP AND POMDP. 


A Markov decision process (MDP) (Bellman| 1958) is a tuple (S, A, p, r, Y), where S is the finitq?| 


non-empty state space, A is the finite non-empty action space, p : Sx A ++ P(S) is the probabilistic 
state-transition function, and r : S x A > P (R) is a lower-bounded reward function. Starting from 
astate s; € S at time t, an action a; € A taken by the agent policy 7 : S + P(A) effects a transition 
to state s:41 ~ p(-|az) and the emission of a reward ri41 ~ r(-|st+1, at). The initial system state 
at time t = 0 is drawn as so ~ p(-|@). For simplicity, we consider episodes of infinite horizon 


and hence introduce a discount factor 0 < y < 1. Ina partially observable MDP 1965 


Kaelbling et al.|(1998| POMDP) (S, A, Q, O, p, r, Y), the agent does not directly observe the system 
state s; but instead r 


eceives an observation o, ~ O(-|s;) where O : S +» P(Q) is an observation 
function and Q is a finite non-empty observation space. In line with standard literature (Monahan| 


>For conciseness, we restrict our exposition to finite state, action and observation spaces. Results 
carry over to continuous state-action-observation spaces under some technical conditions that we omit for 
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[1982h, we disambiguate two stochastic processes that are induced by pairing a POMDP with a policy 
m: The core process, which is the process over state random variables {S;}, and the observation 
process induced by observation random variables {0;}. Please see aoned er a more detailed 
exposition on POMDPSs. 


3.2 OBSERVATION-SPACE ADVERSARIAL ATTACKS. 


Observation-space adversarial attacks consider the scenario where an adversary manipulates the ob- 
servation of a victim at test-time. Much prior work falls within the SA-MDP framework 
let al.|[2020}, in which an adversarial agent with policy £ : S ++ P(S) generates adversarial observa- 
tions o, ~ (s+). The perturbation is bounded by a budget B : S ++ 2°, limiting supp E(-|s) € B(s). 
For simplicity, we consider only zero-sum adversarial attacks, where the adversary minimizes the 
expected return of the victim. In case of additive perturbations, S := Rt, d € N and y; € R? 
2021), €(s¢) := (04). Here, o¢ := st + pr, subject to a real positive per-step perturbation 
budget B such that ||p;||3 < B?, Vt. 


3.3. INFORMATION-THEORETIC HYPOTHESIS TESTING 


Following (Blahut| {1987 1998), we assume two probability distributions Pı and P> over 


the space Q of possible measurements. Given a measurement Q € Q, we let hypothesis Ho be true 
if Q was generated from P, and H; if Q was generated from Pz. A decision rule is then a binary 
partition of Q that assigns each element q E€ Q to one of the two hypotheses. Let a be the type 
I error of accepting Hı when Ho is true, and 8 be the Type II error of accepting Ho when H; is 
true. By the Neyman-Pearson theorem STTEL the optimal decision rule is given by 
assigning q to Ho iff the log-likelihood log (Pı (q)/P2(q4)) > T, where T € R is chosen according to 
the maximum acceptable 8. For a sequence of measurements q+, this decision rule can be extended 
to testing whether 5°, log (Pi(q:)/P2(a)) = eu ea It can further be shown (Blahut 
that d(a, 8) < KL(P)|P2), where KL(Q|P) = Eg [log Q — log P] is the Kullback-Leibler 
divergence between two probability distributions Q and P, and d(a, 8) = a(log a — log(1 — 8)) + 
(1 — a)(log(1 — a) — log 8) is the binary relative entropy. Note that if KL(P,|P2) = 0, then 
a=ß= $3 and therefore Ho cannot be better distinguished from H; than by random guessing. 
Hence Ho and H; are information-theoretically indistinguishable if KL(P|P2) = 0. 


4 ILLUSORY ATTACKS 


4.1 THE ILLUSORY ATTACK FRAMEWORK 


We introduce a novel illusory attack framework in which an adversary attacks a victim acting in the 
environment £ at test time, thus inducing a two-player zero-sum game G (Von Neumann & Morgen- 
1944). Our work assumes that the following facts about G are commonly known (Halpern &| 
1990) by both adversary and victim: At test time, the adversary performs observation-space 
attacks (see Sec. on the victim. The victim can sample from the environment shared with an 
arbitrary adversary at train time, but has no certainty over which specific test-time policy the adver- 
sary will choose. The adversary can sample from the environment shared with an arbitrary victim at 
train time, but has no certainty over which specific test-time policy the victim will choose. The task 
of the victim is to act optimally with respect to its expected test-time return, while the task of the 
adversary is to minimise the victim’s expected test-time return. 


We follow (2023) in that we assume that the victim’s reward signal is endogenous 
2009), 


which means it depends on the victim’s action-observation history and is not explicitly 
modeled at test-time, thereby exposing it to manipulation by the adversary. Additionally, environ- 
ments of interest frequently emit sparse or delayed reward signals that aggravate the task of detecting 


an attacker before catastrophic damage is inevitable (Sutton & Barto}|2018} 2023). 


Assuming the victim’s policy m, : (O x A)“ ++ P(A) conducts adversary detection using 
information-theoretically optimal sequential hypothesis testing on its action-observation history (see 
Section[3.3), the state of the adversary’s MDP must contain the action-observation history of the vic- 
tim. The adversary’s policy v : S x (O x A)“ ++ P(O) therefore conditions on both the state of the 
unattacked MDP, as well as the victim’s action-observation history. This turns the victim’s test-time 
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Figure 2: Left: The unattacked MDP with an 
expected victim return of 1. Right: A regular 
adversarial attack and a perfect illusory attack, 
with an expected vitim return of 0 and E, re- 
spectively. The perfect illusory attack chooses 
observations 09 such that the KL divergence 
between the attacked and unattacked observa- 
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Figure 3: Empirical results for the 1-step MDP 
defined in Figure [2| The adversary’s expected 
return increases with increasing €. At the same 
time, the empirical trajectory KL constraint 
tightly controls the adversary policy’s within e€ 
detectability. The purple line indicates the ad- 
versary’s attack return ceiling at 0.0. 


tion distribution is zero. 


decision process into a POMDP with an infinite state space, making the game G difficult to solve 
with game-theoretic means (see Appendix[A.2). 


In the illusory attack framework, the trajectory density induced by the adversary’s MDP is given by 


Pa(-) = Po(S0)”(00|80)7v(ao|00) IŁ: p(S¢|St—1, at—1)V(Ot|St, O<t, @<t)To(ar|o<ct;a<t). (1) 


The trajectory density of the victim’s observation process (see Sec. B1} in the attacked environment 
is given by 


pu(:,V) S See Pal:,80---8T) (2) 


Note that p,(-, 1,,=s,) reduces to the trajectory density of the unattacked environment 


bi 


Polt) = Pol, Los=s:) = Po(s0)Tu(ao|s0) IL PĆSt|St—1, at—1)To (at|S<t, Vet). (3) 


4.2 THE ILLUSORY OPTIMISATION OBJECTIVE 


At test-time, the adversary assumes that the victim is employing an information-theoretically optimal 
decision rule in order to discriminate between the hypotheses that an adversary is present, or not (see 
Section |3). At each test-time step, the victim only has access to an empirical distribution /, (-, v) 
based on its test-time samples N collected so far, which constrains the power of its hypothesis test. 


We here assume that the adversary does not know how many test-time samples the victim can col- 
lect, but has sampling access to the victim’s test-time policy 7,. Therefore, in order to degrade 
the victim’s decision rule performance, the adversary aims to ensure that the KL-distance between 
Pv(-,v) and p,(-) is smaller than a detectability threshold e. To maximise attack strength, the adver- 
sary would choose the highest € that warrants undetectability, i.e., renders the victim agent unable to 
distinguish between the observed trajectory distribution of the attacked and unattacked environment. 


We now define information-theoretical optimal adversarial attacks (e-illusory attacks) for a given 
detectability threshold e. We set the direction of the KL-divergence analogously to|Cachin|(1998). 


Definition 4.1 (c-illusory attacks). An e-illusory attack is an adversarial attack v* which minimizes 
the victim reward, subject to KL (p,(-)||pu(-,v)) < €: 


v“ = arg inf Ervpa [Ri], st. KL (po(-)|[Po(-,7)) < € (4) 
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The ¢-illusory attack objective] therefore aims to train an adversary that reduces the victim’s ex- 
pected cumulative return, while keeping its observed trajectory distribution e-close to the one it 
would have observed in the unattacked environment. 


We refer to illusory attacks that satisfy € = 0 as perfect illusory attacks. In this case, to the victim, the 
presence of the adversary induces a POMDP with infinite state-space (see Appendix [A.2}, in which 
the core process over MDP states (see Section[3.1) differs, but the observation process is statistically 
indistinguishable from the state-transition dynamics of the unattacked MDP. Importantly, as the 
illusory KL constraint is distributional, the adversary can learn stochastic adversarial attack policies 
that are not restricted to the identity function. 


Definition 4.2 (Perfect illusory attacks). A perfect illusory attack is any undetectable non-trivial 
adversarial attack v, i.e. any v for which v 4 1o,=s, and KL (p,(-)||pu(-,v)) = 0. 


Example. We now build up some intuition over the meaning of illusory attacks by studying a 
simple single-step stochastic control environment (Figure p). The environment is assigned one of 
two initial states with probabilities 3 and 2, respectively. In the unattacked environment (Figure 
[2] left), the victim can observe the initial state sọ, while under an adversarial attack, it observes oo 
(see right side). Given its observation, the victim chooses between two actions, upon which the 
environment terminates and a scalar reward is issued. The reward conditions on the initial state and 
the victim’s action. Without undetectability constraints, the optimal observation-space attack always 
generates observations fooling the victim over the initial state (Regular Adversary in Figure pB}. 
however, changing the victim’s observed initial state distribution. This makes this attack detectable. 
In contrast, a perfect illusory attack only fools the victim half of the time when in the second initial 
state, and always when in the first initial state, as this does not change the victim’s observed initial 
state distribution. Note that attack undetectability comes at the cost of a higher expected victim 
return of z for the perfect illusory attack, compared to 0 return under the regular adversarial attack. 


4.3 DUAL-ASCENT FORMULATION 


To solve the e-illusory attack objective (see Def. B-T}. we propose the following dual-ascent algo- 
rithm (Boyd & Vandenberghe |2004) with learning rate hyper-parameter añ c R4: 
vipi = arg inf Er~p, Ri] — x1 [KL (pv(-)Iloo(-.)) = e 
Anti = max (Ag + a} [KL (p0(-)||Pv(-,4)) — € 0) 


(5) 


This algorithm alternates between policy updates and À updates. As the KL-constraint is violated, À 
adapts, thus modifying the influence on the KL-constraint in the policy update objective. Note that 
Xo has to be initialized heuristically. 


4.4 ESTIMATING THE KL-OBJECTIVE 


Accurately estimating the KL objective in Def. [4.T]is, in general, a computationally complex prob- 
lem due to its nested form and the large support of p,(-) and p,(-, v) (see also Appendix|A.3). We 
write 


KL (pv(-)||Po(-, v)) = Epps (+) [log 0] ,=5H [Pv(-), Pol, v)] -H [Pv(-)] (6) 


where H [p,(-)] is the entropy, and H [p,(-), pv(-, ”)] is the cross-entropy (Murphy\|2012) p. 953). 


We now explicitly construct an estimator for the cross-entropy term. Let A = IŁ, Ty (azloct, Get). 
Then, py(-) = A - po(00) [Tyr Pt (r+1 104, ax), and 


Polly) =A-E | (cols) (ou), 0040) 5 (caso, 0202) -e | se . (7) 


So s2 


3Note that the e-illusory attack objective differs from a standard constrained MDP 2021 
CMDP) problem in that the illusory constraint cannot be expressed as a discounted sum over state-transition 


costs (Achiam et al.||2017| CPO), but instead depends on trajectory densities. 
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Figure 4: We display normalised adversary scores, indicating the reduction in the victim’s reward, 
on the y-axis. Each plot shows results in different environments, with different adversarial attacks 
on the x-axis. We show both the raw adversary score, as well as the adversary score adjusted 
for detection rates of different adversarial attacks (see Figure |5). While the SA-MDP and MNP 
benchmark attacks achieve higher unadjusted scores, their high detection rates result in significantly 
lower adjusted scores. 


Constructing an unbiased estimator of H (-) is known to be non-trivial (Shalev et al.) |2022). How- 


ever, we note that the victim (and adversary) have access to a large number of samples from p,(-), 
and, in the case of the adversary, p,(-,1’). In this work, we employ a simple, but highly scalable 


estimator. Jensen’s inequality 1906) yields 
H [ov (:); Pul, v)] = “ou (-) [log PTET [B]] S= E O. [log B] , (8) 


where B = v (oo|so) [], v(ot|St, 0<t, @<t). This yields the upper-bound Monte-Carlo estimator 


H [po(:), palv) = -7 2 log v(o0ls0) + 2.096 V(04|8¢, Oct act) | > (9) 
where (o1, a1)! "AP py(-), and sh "A" po, siso AF p. 


5 EMPIRICAL EVALUATION OF ILLUSORY ATTACKS 


We illustrate illusory attacks in a simple stochastic MDP (see Fig. 2). where we show that our opti- 
mization algorithm allows to precisely control the KL distance between the trajectory distributions 
of the attacked and unattacked environment. We then conduct an extensive evaluation of illusory 
attacks in standard high-dimensional RL benchmark environments 
fet al.|{2021). We first empirically demonstrate the ineffectiveness of state-of-the-art robustification 
methods for large perturbation budgets B (see Sec.[3.2). However, we show that state-of-the-art out- 
of-distribution detectors can readily detect such attacks, rendering them ineffective. In contrast, we 
show that e-illusory attacks with large perturbation budgets can be effective, yet undetectable. This 
demonstrates that ¢-illusory attacks can be more performant than existing attacks against victims 
with state-of-the-art anomaly detectors. In an IRB-approved study, we demonstrate that humans, 
efficiently detect state-of-the-art observation-space adversarial attacks on simple control environ- 
ments, but are considerably less likely to detect €-illusory attacks (Section|5.0.1). We provide videos 


on the project web page at/https://tinyurl.com/illusory-attacks 


Experimental setup. We consider the simple stochastic MDP explained in Figure|2}and the four 
standard benchmark environments CartPole, Pendulum, Hopper and HalfCheetah (see Figure [6]in 
the Appendix), which have continuous state spaces whose dimensionalities range from 1 to 17, as 
well as continuous and discrete action spaces. The mean and standard deviations of both detection 
and performance results are estimated from 200 independent episodes per each of 5 random seeds. 
Victim policies are pre-trained in unattacked environments, and frozen during adversary training. 


We assume the adversary has access to the unattacked environment’s state-transition function p. 


Precisely controlling trajectory KL divergence. Using an exact implementation of Equation [5] 
we learn ¢-illusory attacks for the single-step MDP environment pictured in Figure }2} As can be 
seen in Figure[3] the measured KL(p,(-)||pu(-, v) at convergence is bounded tightly by € until it hits 
the divergence value for the unconstrained adversarial attack at ca. € = 0.11. The adversary’s return 
increases with increasing e until it reaches the return of the unconstrained attack at e = 0.0. 
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y-axis. We see that both the automated detec- 
tor as well as human subjects are able to detect 
SA-MDP and MNP attacks, while e-illusory at- 
tacks are less likely to be detected. 


Update vy using (s, 0, rN Snew). 
à = max(0,\ +a(Dxz — 6). 
end for 


Effectiveness of state-of-the-art robustification methods under large-budget attacks. We first 
investigate the effectiveness of different robustification methods against a variety of adversarial at- 
tacks, considering randomized smoothing (Kumar et al.| and adversarial pretraining (ATLA, 
2021a)), for budgets B € {0.02,0.2}. We compare the performance improvement 
under adversarial attacks of each method relative to the performance without robustification. For an 
attack budget B = 0.05, we find that randomized smoothing results in an average improvement of 
61%, while adversarial pretraining results in a 10% performance improvement. However, for the 
large attack budget B = 0.2, both only result in average performance improvements of 15% and 
8%, respectively (see Appendix |A.5]for details). 


5.0.1 COMPARATIVE EVALUATION OF ILLUSORY ATTACKS 


Setup. For all four evaluation environments, we implement perfect illusory attacks (see Def. 
by first constructing an attacked initial state distribution p(-|Ø) that exploits environment-specific 
symmetries. We then sample the initial attacked observations og from the attacked initial state 
distribution and generate subsequent transitions using the unattacked state transition function 
p(-|Oz-1, @¢—1) Where a,_, is the action taken at the last time step (see Appendix [A.6] for details). 
In contrast to perfect illusory attacks, e-illusory attacks are learned end-to-end using reinforcement 
learning. For this, we use a practical variant of the illusory dual ascent objective and estimate the 
KL-Distance in accordance with the single-sample estimate of the MC-estimate defined in Eq. p] 


(see Algorithm [1] and Appendix A-7). We estimate Dxz in Algorithm 1, i.e. the penalty term 
used to update the dual parameter À, as the sliding window average of the D gz estimate defined in 
Equation [6] using a single-sample estimate (see Eq.|9). We equip the victim agent with the state-of- 
the-art out-of-distribution detector introduced by (2023), which is trained on trajectories 
of the unattacked environment. This detector provides anomaly scores which we use to establish a 
CUSUM decision rule tuned to achieve a false positive rate of 3%. We adjust the €- 
illusory treshold to the empirical sensitivity of the detector on each environment. We consider attack 
budgets (see Sec. B = 0.05 and B = 0.2, but focus on B = 0.2 in this analysis (see Appendix 
for all results); to ensure a fair comparison, we also apply the attack budget to ¢-illusory attacks. 


Adversary performance against victim agents with automated detectors. We investigate the 
adversaries’ relative performance in comparison to state-of-the-art adversarial attacks, specifically 
SA-MDP attacks and MNP attacks (Kumar et al.|[2021). MNP attacks can only 
be implemented in CartPole which has a discrete action space. We define the scalar adversary score 
as the resultant reduction in the victim’s return, normalized relative to both the highest adversarial 
return in each class, as well as the victim’s expected return in the unattacked environment. We 
simulate contingency actions by setting the adversary’s return to zero across episodes classified as 
attacked. This reflects a middle ground across different scenarios in which adversary detection could 
trigger victim contingency options ranging from no action to test-time termination, major security 
escalation, and adversary persecution. As detailed in Figure [5] the detector detects MNP and SA- 


Published as a conference paper at ICLR 2024 


MDP attacks with a probability close to 100%. In contrast, the detector classifies ¢-illusory attacks 
as adversarially attacked with very low probability. In coherence, the full columns in Figure|4]show 
that ¢-illusory attacks result in the highest average adversary scores when adjusting for detection. In 
contrast, detection-adjusted adversary scores for state-of-the-art attacks are close to zero, which is 
expected due to their high empirical detectability (see Fig. 5). 


Detection of adversarial attacks by human inspection. We we perform a controlled study with 
n = 10 human participants to investigate whether humans unfamiliar with adversarial attacks can 
detect adversarial attacks in simple and easy-to-understand environments. We found CartPole and 
Pendulum, in contrast to Hopper and HalfCheetah, to be immediately accessible to participants with- 
out the need for additional training. Participants were first shown an unattacked introduction video 
for both CartPole and Pendulum, exposing them to environment-specific dynamics. Participants 
were then shown a random set of videos containing both videos of unattacked and attacked trajec- 
tories. For each video, participants were asked to indicate whether they believed that the video was 
unsuspicious, with the prompt ‘the system shown in the video was [not] the same as the one from 
the introduction video’. This phrasing was chosen so that participants would not be primed on the 
concept of illusory attacks (see details in Appendix [A.8). 


We found that participants classified MNP and SA-MDP attacks as suspicious with high accuracy 
(see Human detection in Figure [5). In contrast, participants were almost equally likely to classify 
videos of unattacked and e¢-illusory attacked trajectories as unsuspicious. In fact, at a confidence 
level of 95%, the hypothesis ‘participants are equally likely to classify an unattacked sequence as 
attacked as to classify an €-illusory attacked sequence as attacked’ cannot be rejected. Our findings 
suggest that humans are unable to detect ¢-illusory attacks from short observation sequences in our 
simple environments. See Appendix [A-8]for full results and the corresponding z-test statistic. 


6 CONCLUSION AND FUTURE WORK 


This paper introduces a novel class of observation-space adversarial attacks, illusory attacks, which 
admit an information-theoretically grounded notion of statistical detectability. We show the effec- 
tiveness and scalability our approach against both humans, and AI agents with access to state-of- 
the-art anomaly detectors across a variety of benchmarks. 


We expect the potential positive impact of our work to outweigh the potential negative consequences 
as our work contributes to the design of secure cyber-physical systems. However, it should be 
acknowledged we assume the availability of contingency options for victim agents, which may not 
always hold true in real-world scenarios. Moreover, our experimental investigations are confined to 
simulated environments, necessitating further exploration in more intricate real-world domains. 


Future research should conduct comprehensive theoretical analysis of the Nash equilibria within the 
two-player zero-sum game introduced by the illusory attack framework. Furthermore, efforts are 
required to develop more effective defenses against adversarial attacks applicable to real-world en- 
vironments, including (1) improved detection mechanisms, (2) robustified policies that incorporate 
detectors, and (3) improved methods to harden observation channels against adversarial attacks. An 
equally significant aspect of detection is gaining a deeper understanding of the human capability to 
perceive and identify (illusory) adversarial attacks. We ultimately aim to demonstrate the viability 
of illusory attacks and the corresponding defense strategies in real-world settings, particularly in 
mixed-autonomy scenarios. 


Reproducibility. We are committed to promoting reproducibility and transparency in our re- 
search. To facilitate the a of our results, we release the code on our project page 


at h This code reimplements Illusory attacks in 
JAX Bradbury et al. We provide detailed overviews for all steps of the 


experiments it in T omit vice we also link to the publicly available Code reposito- 
ries that our work uses. 
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A APPENDIX 


A.l ADDITIONAL RELATED WORK 


Assuming a different black-box setting, (2019) introduce a class of adversaries for which a 


unique mask is precomputed and added to the agent observation at every time step. Our framework differs 
from these previous works in that it preserves consistency across trajectories of observation sequences. 
proposes adversarial attacks motivated by a notion of imperceptibility measured in policy network 
activation space. One major difference is that the paper focuses on per-state imperceptibility, while our work 
focuses on information-theoretic undetectability, which hence requires focusing on whole trajectories. 


Eykholt et al, [2018 2 
(Sun et al.| |2020a} |Cao et al.| |2019 Tu et al. , and 
Abdelfattah et al. 


develop an action-conditioned frame module that allows agents to detect adversarial attacks 
by comparing both the module’s action distribution with the realised action distribution. (2021) 
detect adversaries by evaluating the feasibility of past action sequences. (2019); [Sun et al.|(2020b); 
[Huang & Zhu] (2019); [Korkmaz & Brown-Cohen] (2023) focus on the detectability of adversarial attacks but 


without considering notions of stochastic equivalence between observation processes. 


AP attack targets include cameras ( 


(2020) [Eu et aj 2021), LIDAR 


multi-sensor fusion mechanisms ( 


021 


A.2 POMDP CORRESPONDENCE 


POMDP) (S, A, Q, O, p, r, 7), the agent does not directly observe the system state s+ but 

receives an observation o: ~ O (-|s+) where O : S ++ P(Q) is an observation function and Q is a finite 

non-empty observation space. The canonical embedding pomdp : SN — $ from the set of finite MDPs IN to 

the family of POMDPs 58 maps 2 +> S, and sets O(s) = s, Vs € S. Ina POMDP, the agent acts on a policy 

m : He + P(A), growing a history ht+1 = htator+ıri+ı from a set of histories H* := (A x O x R)’, 

where H* := U, H' denotes the set of all finite histories. We denote histories (or sets of histories) from which 
reward signals have been removed as (-),,. 


In line with standard literature [1982}, we distinguish between two stochastic processes that are 
induced by pairing a POMDP with a policy m: The core process, which is the process over state random 
variables {S+}, and the observation process, which is induced by observation random variables {O;}. The 
frequentist agent’s goal is then to find an optimal policy 7* that maximizes the total expected discounted 
return, i.e. 7* = argsup, eq En.~Pz, pg Yre, where I := {7 : HY + P(A)} is the set of all policies. 


We begin this section by defining standard POMDP notation. In a partially observable MDP 1965 
Kaelbling et al. 
instead 


Now consider a POMDP £. := (S’, A, Q, O’,p’, r, y) with finite horizon T, a state space S’ := (Sx Ax)’, 
deterministic observation function O’ : S’ ++ Q, and stochastic state transition function p’ : S’ x A œ> P(S’). 
Then, for any 7, : HŽ +> P(A) and v : S x HY +> P(Q), we can define corresponding p’ and O’ such that 
the reward and observation processes cannot be distinguished by the victim. We now proceed to the formal 
Theorem. 


Theorem A.1 (POMDP Correspondence). For any EẸ, there exists a corresponding POMDP Ee (EY) for 
which the victim’s learning problem is identical. 


Proof. Recall that the semantics of €; are as follows: Fix a victim policy m : Hie +» P from the space of 
all possible sampling policies II. At time t = 0, we sample an initial state so ~ p(-|Ø). The adversary then 
samples an observation 09 ~ v(-|so) which is emitted to the victim. The victim takes an action ap ~ 7(-|00), 
upon which the state transitions to sı ~ p(-|so, ao) and the victim receives a reward rı ~ (-|so, ao). At time 
t > 0, the victim has accumulated a history hy := ooaor1 ... ot, on which os ~ v(-|st, hrr) conditions. 


Define p’ as the following sequential stochastic process: At time t = 0, first sample so ~ p(-|0). Then sample 
oo ~ v(-|so), and define sg := p'(Ø) := (80,00). For any t > 0, first sample s+ ~ p(-|s¢—1, at—1), then 
ot ~ V(-|S<t,@<t, O<t) and define s, := p'(s;_1, St, Ot, at—1). We finally define O (s4) := proj, (s4) := ot, 
where we indicate that o+ is stored in s, by using an explicit projection operator proj,. Clearly, under any 
sampling policy 7, the observation and reward processes induced by Ee and €7” are identical as T — oo. This 
renders the reward and observation processes identical in both environments. Note that, as T > oo, Ee’s state 
space grows infinitely large. 


In other words, Theorem|A.I]implies that, given enough memory (Yu & Bertsekas||2008) , the adversary can 


be chosen such that the state-space of E-(E})) becomes arbitrarily due to its infinite horizon. This renders the 
worst-case problem of finding an optimal victim policy in €-(€)) intractable even if the adversary’s policy 


is known 2005 2016). The underlying game G, therefore, assumes an infinite state space, 
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| FJ 


Figure 6: Benchmark environments used for empirical evaluation, from left to right. In CartPole, 
the agent has to balance a pole by moving the black cart. In Pendulum, the agent has to apply a 
torque action to balance the pendulum upright. In Hopper and HalfCheetah, the agent has to choose 
high-dimensional control inputs such that the agent moves towards the right of the image. 


preventing recent progress in solving finite-horizon extensive-form games (Kovařík et al.|/2022}|McAleer et al. 
2023 2023) from being leveraged in characterizing its Nash equilibria. We now a give a proof o 


construction. 


A.3 ON THE DIFFICULTY OF ESTIMATING THE ILLUSORY OBJECTIVE 


We note that estimating the illusory objective is, in general, difficult. Even when choosing a nonparametric 


aot 8 
kernel with optimal bandwidth, the risk of conditional density estimators increases as O(N 4+4 ) with support 


dimensionality d (Wasserman} |2006} |Griinewalder et al.||2012}|Fellows et al.||2023). This is aggravated by 
KL-estimation being a nested estimation problem (Rainforth et al.|/2018). 


While the estimator bias may be further reduced by using a more sophisticated nested estimation method 
such as a multi-level MC estimator (Naesseth et al.|/2015), and by performing improved estimates for p, (-, v) 


using variational inference (Blei et al.[[2017| VD, or sequential Monte-Carlo 2001} SMC), these 


methods come with increased computational complexity. 


A.4 DETECTOR AND DECISION RULE USED IN EXPERIMENTS 


We implement the out-of-distribution detector proposed by (2023) using the implementation 
provided by the authors] As this detector provides anomaly scores at every time step but does not provide 


a decision rule for classifying a distribution as attacked, we implement a CUSUM 1954) decision rule 
based on the observed anomaly scores observed at test time and the mean anomaly score for a held-out test set 
of unattacked episodes. We train the detector on unperturbed environment interactions, using the configuration 
provided by the authors. We then tune the CUSUM decision rule such that a per-episode false positive rate 
of 3% is achieved. We assess the accuracy of detecting adversarial attacks across all scenarios presented in 


Table 


A.5 ROBUSTIFICATION 


We implement the ATLA (Zhang et al.|!2021a) victim by co-training it with an adversary agent, and follow 
the original implementation of the authors|"| We implemented randomized smoothing as a standard defense 


against adversarial attacks on RL agents, as introduced in (2021). We use the author’s original 


implementation} See Table}1|for results. 


A.6 PERFECT ILLUSORY ATTACKS IMPLEMENTATION 


We implement perfect illusory attacks as detailed in Algorithm[2] The first observation oo is set to the negative 
of the true first state sampled from the environment. Note that in HalfCheetah and Hopper the initial state 
distribution is not centered around the origin, we hence first subtract the offset, and then compute the negative of 
the observation and add the offset again. As the distribution over initial states is symmetric in all environments 
(after removing the offset), this approach satisfies the conditions of a perfect illusory attack (see Definition|4.2). 


“https: //github.com/FraunhoferIKS/pedm-ood 


“https://github.com/huanzhang12/ATLA_robust_RL 


‘https: //openreview.net/forum?id=mwdfai 8NBrJ 
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Algorithm 2 Perfect illusory adversarial attack 


Input: environment env, environment transition function ¢ whose initial state distribution p(-|Ø) 
is symmetric with respect to the point Dsymmetry in S, victim policy Ty. 


k=0 
So = env.reset() 
00 = — (so J Psymmetry) + Psymmetry 


ao = Ty(00) 
_, done = env.step(ag) 
while not done do 
k=k+1 
Ok ~ t(Ok—1, a@k—1) 
ak = Ty (0k) 
_, done = env.step(ap) 
end while 


A.7 LEARNING €-ILLUSORY ATTACKS WITH REINFORCEMENT LEARNING 


We next describe the algorithm used to learn e-illusory attacks and the training procedures used to compute 
the results in Table We use the CartPole, Pendulum, HalfCheetah and Hopper environments as given 
in [Brockman et al.| (2016). We shortened the episodes in Hopper and HalfCheetah to 300 steps to speed up 
training. The transition function is implemented using the physics engines given in all environments. We 
normalize observations by the maximum absolute observation. We train the victim with PPO 


2017) and use the implementation of PPO given in|Raffin et al.|(2021), while not making any changes to the 


given hyperparameters. In both environments we train the victim for | million environment steps. 


We implement the illusory adversary agent with SAC (Haarnoja et al.|/2018), where we likewise use the imple- 
mentation given in|Raffin et al.|(2021). We initially ran a small study and investigated four different algorithms 


as possible implementations for the adversary agent, where we found that SAC yields best performance and 
training stability. We outline the dual ascent update steps in Algorithm [I] which, like RCPO 
(2018), pulls a single-sample approximation of the constraint into the reward objective. We approximate Dx, 
by taking the mean of the constraint violation ||o — p(0cia, aota) \|3 over the last 50 time steps. We further ran 
a small study over hyperparameters a € {0.01, , 0.1, 1} and the initial value for A € {10, 100} and chose the 
best performing combination. We train all adversarial attacks for four million environment steps. 


Detection rate averaged across environments 


Table 1: Adversary scores under different at- 
25 tacks and defenses. 


Detection Rate in Percent 


Norm. adversary [%] 
E E-illusory attack (ours) E| Automated detection 


E SA-MDP attack 2 E 
E MNP attack** 3 3 < 
** Only applicable in CartPole Budget B 2 5 z 
0.05 347 64+6 - 
Figure 7: Detection results for B = 0.05. Dif- oy ee s T 
ferent adversarial attacks are shown on the x- 0.2 8746 7243 7+6 


axis, with detection rates on the y-axis. We see 
that the automated reliably detector detects SA- 
MDP and MNP attacks, while ¢-illusory attacks 
are less likely to be detected. 


Computational overhead of ¢-illusory attacks. Note that there is no computational overhead of our 
method at test-time. We found in our experiments that the computational overhead during training of the 
adversarial attack scaled with the quality of the learned attack. In general, we found that the training wall-clock 
time for the e-illusory attacks attacks results presented in Table 1 was about twice that of the SA-MDP attack 
(note that MNP attacks and perfect illusory attacks do not require training). 
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A.7.1 RESULTS FOR PERTURBATION BUDGET 0.05 


We show the remaining results for a perturbation budget of B = 0.05 in Figures [8] and [7] Note that the 
corresponding Figures in the main paper are for a perturbation budget of B = 0.2. 


Pendulum 


Adversary score in 


E Evillusory attack (ours) Hl Perfect illusory attack (ours) E SA-MDP attack 


** Only applicable in CartPole 


CartPole 


HalfCheetah Hopper 


E MNP attack** 


Raw adversary score 
Œ Adversary score adjusted for detection 


Figure 8: Results for B = 0.05. We display normalised adversary scores, indicating the reduction in 
the victim’s reward, on the y-axis. Each plot shows results in different environments, with different 
adversarial attacks on the x-axis. We show both the raw adversary score, as well as the adversary 
score adjusted for detection rates of different adversarial attacks (see Figure B). While the SA-MDP 
and MNP benchmark attacks achieve higher unadjusted scores, their high detection rates result in 
significantly lower adjusted scores. Note that MNP attacks perform significantly worse for B = 


0.05, as compared to B = 0.2 (see Figure f4}. 


gS 


Table 2: Full results table for all four environments 


attack budget 6 Detection Rate Victim reward 
Pendulum 

SA-MDP (Zhang et al.J2021a 0.05 76.3+0.05 -797.2+69.9 
e-illusory attack (ours 0+0 = -524.1+104.3 
SA-MDP (Zhang et al.J2021a 0.2 1000.03  -1387.04119.0 
e-illusory attack (ours 3.60.02 -980.0+84.0 
Perfect illusory attack (ours) 1 3.00.02 -1204.8+88.6 
unattacked 3.2+0.03 -189.4 
CartPole 

MNP ‘Kumar et al.. 2021 0.05 86.9+0.3 485.0+33.5 
SA-MDP ‘202 1a) 80.540.8 9.40.2 
e-illusory attack (ours 1.5+0.02 12.9+0.3 
MNP ‘Kumar et al.. 2021 0.2 10040 18.3+20.8 
SA-MDP 202 1a) 100+0 9.3+0.1 
e-illusory attack (ours 3.70.01 11.0+0.5 
Perfect illusory attack (ours) 1 3.10.01 30.142.2 
unattacked 3.2+0.01 500.0 
HalfCheetah 

SA-MDP (Zhang et al.J2021a 0.05 10040  -1570.8+177.4 
e-illusory attack (ours 0+0 -180.8+ 50.1 
SA-MDP (Zhang et al.J2021a 0.2 10040 -1643.8+344.8 
e-illusory attack (ours 0+0 -240.6+ 18.0 
Perfect illusory attack (ours) 1 2.90.04 5.9 +36.8 
unattacked 3.140.02 2594.6 
Hopper 

SA-MDP 2021a 0.05 87.4+0.02 144.1+265.4 
e-illusory Attack TOOTS 0+0  209.4+90.8 
SA-MDP (Zhang et al.J2021a 0.2 -761.5+127.4 
e-illusory attack (ours -260.94140.8 
Perfect illusory attack (ours) 1 3.140.02 679.2%63.9 
unattacked 2.80.08 958.1 


A.8 HUMAN STUDY 


Study approval. 


R84123/RE001. 


Our study was approved by an independent 
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Table 3: Results from our study with human participants. 


Environment 


both Pendulum CartPole 
P(false | no attack) 34.24+11.4 31.5+105 37.0412.3 
P(false | SA-MDP) 81.4 +27.2 96.3 +32.1 66.7 + 22.2 
P(false | e-illusory attack) 32.4 +10.8 37.0 +12.3 27.7 +9.3 
P(false | MNP) 83.3 + 27.8 83.3 + 27.8 


Setup. We performed a controlled study with n = 10 human participants. All participants were graduate- 
level university students. None had prior knowledge about the objective of the study. Participants participated 
voluntarily; we estimate the time needed per participant to be around 15 minutes. Participants were handed a 
slide show which contained all relevant information. This slide show is included in the supplementary material 
in the respective folder. We further add the sheet with ground truth labels for all video sequences. 


After consenting to participate, participants were provided with the slide show and an online sheet to indicate 
their answers. The study was self-paced and done by the participants without further supervision. We further 
advised participants not to do the study if they would not feel comfortable at any point in time. 


The first 18 videos in the slide show are 100-frame long sequences of the Pendulum environment, i.e. six 
each for either no adversarial attack, an ¢€-illusory attack (for which we added an lə penalty to the l2 penalty 


in Algorithm [I or an SA-MDP attack (Zhang et al.||2021a). The subsequent 24 10-frame long sequences 


of the CartPole include the same attacks as for Pendulum, plus an additional six sequences under an MNP 
attack (Kumar et al. /2021). Note that MNP attacks cannot be applied to the Pendulum environment, as it has 
a continuous action space. For all attacks, the budget B was 0.2. Within both groups, the order of the videos 
shown was randomised. Subjects had to choose for each video between correct and false, where false indicates 
that subjects find that the system observed in the respective video is not the same as the one observed in the 
introductory video. We likewise refer to a video marked as false as a suspicious video. Table [3] includes all 
empirically estimated probabilities and standard deviations for classifying a segment from a given class as false. 


Result analysis. We find that subjects reported one-third of unattacked sequences as false. Subjects clas- 
sified sequences under an MNP or SA-MDP attack as false with a probability higher than 
80%, while categorizing illusory attacks as false with a probability of only 32.4%. We computed the double- 
sided z-statistics and were able to reject both the hypothesis that P(false | SA-MDP)=P(false | no attack) and 
the hypothesis that P(false | MNP)=P(false | no attack) for a = 0.05, while the hypothesis that P(false | 
E-illusory attack )=P(false | no attack) cannot be rejected. We conclude that subjects were able to distinguish 
SA-MDP and MNP attacks from unattacked sequences while being unable to distinguish illusory attacks from 
unattacked sequences. 


A.9 RUNTIME COMPARISON 


We investigate wall-clock time for training different adversarial attacks. We first recall that MNP attacks (Ku-| 
2021) as well as perfect illusory attacks do not require training. For SA-MDP attacks (Zhang et al. 


2021a) and e-illusory attacks, training time is highly dependent on the complexity of the environment, with 
lower training times for the CartPole and Pendulum environments, and higher training times for Hopper and 
HalfCheetah environments. All reported times are measured using an NVIDIA GeForce GTX 1080 and an 
Intel Xeon Silver 4116 CPU. We trained SA-MDP attacks for 6 hours, and 12 hours in the simpler and more 
complex environments respectively. We trained e€-illusory attacks for 10 hours, and 20 hours in the simpler and 
more complex environments respectively. At test-time, inference times for e€-illusory attacks are identical to 
SA-MDP attacks as they only consist of a neural network forward pass. Memory requirements are identical. 


20 


